mean_squared_error (MSE)#
Mean squared error (MSE) is one of the most common regression metrics and training losses. It measures the average squared distance between targets and predictions.
Learning goals#
Understand the definition, units, and common variants (weights, multioutput)
Build intuition for why squaring emphasizes large errors (outliers)
Implement MSE from scratch in NumPy
Visualize the loss landscape for a simple model
Use MSE to fit a linear regression model via gradient descent
Prerequisites#
Basic NumPy arrays and broadcasting
Derivatives of polynomials (or comfort with gradients)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.metrics import mean_squared_error
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Definition#
Let \(y \in \mathbb{R}^n\) be targets and \(\hat{y} \in \mathbb{R}^n\) predictions.
Define the residual \(r_i = y_i - \hat{y}_i\). In vector form:
Weighted MSE#
If some samples matter more (or have different noise levels), use non-negative weights \(w_i\):
Multioutput#
For \(y,\hat{y} \in \mathbb{R}^{n\times d}\) (d targets per sample), compute per-output MSE:
and then aggregate across outputs (uniform average or a weighted average).
Units#
If \(y\) is measured in meters, then MSE is measured in meters². For interpretability, people often report \(\mathrm{RMSE}=\sqrt{\mathrm{MSE}}\) in meters.
y_true = np.array([2.0, 0.0, 4.0, 1.0])
y_pred = np.array([1.5, 0.2, 3.0, 2.0])
residual = y_true - y_pred
sq_error = residual**2
mse = float(np.mean(sq_error))
print("y_true =", y_true)
print("y_pred =", y_pred)
print("residual =", residual)
print("(residual)^2 =", sq_error)
print("MSE (NumPy) =", mse)
print("MSE (sklearn)=", mean_squared_error(y_true, y_pred))
y_true = [2. 0. 4. 1.]
y_pred = [1.5 0.2 3. 2. ]
residual = [ 0.5 -0.2 1. -1. ]
(residual)^2 = [0.25 0.04 1. 1. ]
MSE (NumPy) = 0.5725
MSE (sklearn)= 0.5725
idx = np.arange(len(y_true))
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Residuals (y - ŷ)", "Contribution to MSE: (y - ŷ)²"),
)
fig.add_trace(
go.Scatter(x=idx, y=y_true, mode="markers", name="y_true", marker=dict(size=10)),
row=1,
col=1,
)
fig.add_trace(
go.Scatter(x=idx, y=y_pred, mode="markers", name="y_pred", marker=dict(size=10)),
row=1,
col=1,
)
for i in idx:
fig.add_trace(
go.Scatter(
x=[i, i],
y=[y_pred[i], y_true[i]],
mode="lines",
line=dict(color="rgba(0,0,0,0.35)", width=2),
showlegend=False,
),
row=1,
col=1,
)
fig.add_trace(go.Bar(x=idx, y=sq_error, name="(y - ŷ)²"), row=1, col=2)
fig.update_xaxes(title_text="sample index", row=1, col=1)
fig.update_yaxes(title_text="value", row=1, col=1)
fig.update_xaxes(title_text="sample index", row=1, col=2)
fig.update_yaxes(title_text="squared error", row=1, col=2)
fig.update_layout(title=f"MSE = mean((y - ŷ)²) = {mse:.3f}", legend=dict(orientation="h"))
fig.show()
2) Why do we square the errors?#
For a single residual \(r\), the squared loss is:
Key properties:
Emphasizes large errors: doubling \(|r|\) multiplies the penalty by \(4\).
Smooth and differentiable: useful for gradient-based optimization.
For many models (e.g. linear regression), the MSE objective is convex.
The derivative is simple:
For MSE, the derivative with respect to a prediction \(\hat{y}_i\) is:
So the gradient magnitude grows linearly with the residual: big mistakes produce big updates.
A common alternative is MAE (mean absolute error), which is more robust to outliers but is not differentiable at \(r=0\).
r = np.linspace(-5, 5, 400)
fig = go.Figure()
fig.add_trace(go.Scatter(x=r, y=r**2, mode="lines", name="Squared loss r²"))
fig.add_trace(go.Scatter(x=r, y=np.abs(r), mode="lines", name="Absolute loss |r|"))
fig.update_layout(
title="How loss grows with the residual r",
xaxis_title="residual r",
yaxis_title="loss",
)
fig.show()
3) Probabilistic view: why MSE is ‘natural’ for Gaussian noise#
Assume a regression model predicts the conditional mean and errors are Gaussian:
The negative log-likelihood (up to constants that don’t depend on \(\hat{y}\)) is:
So minimizing squared error is equivalent to maximum likelihood estimation under i.i.d. Gaussian noise.
If different points have different noise levels (heteroskedasticity), a weighted MSE (weighted least squares) can be more appropriate.
4) Mean squared error from scratch (NumPy)#
Below is a small implementation that mirrors the core ideas in sklearn.metrics.mean_squared_error:
works for 1D targets
(n,)and multioutput targets(n, d)optional
sample_weightfor weighted MSEmultioutputaggregation:'raw_values','uniform_average', or an array of output weights
This is intentionally “low-level”: no pandas, no sklearn internals.
def mean_squared_error_np(
y_true,
y_pred,
*,
sample_weight=None,
multioutput="uniform_average",
):
'''Compute mean squared error (MSE) with optional sample weights and multioutput support.
Parameters
----------
y_true, y_pred:
Array-like with shape (n,) or (n, d)
sample_weight:
None or array-like with shape (n,)
multioutput:
- 'raw_values' -> return array of shape (d,)
- 'uniform_average' -> return scalar (mean over outputs)
- array-like (d,) -> weighted average over outputs
Returns
-------
float or np.ndarray
'''
y_true_arr = np.asarray(y_true, dtype=float)
y_pred_arr = np.asarray(y_pred, dtype=float)
if y_true_arr.shape != y_pred_arr.shape:
raise ValueError(f"Shape mismatch: y_true {y_true_arr.shape} vs y_pred {y_pred_arr.shape}")
if y_true_arr.ndim == 1:
y_true_arr = y_true_arr[:, None]
y_pred_arr = y_pred_arr[:, None]
elif y_true_arr.ndim != 2:
raise ValueError("y_true/y_pred must be 1D or 2D")
n_samples, n_outputs = y_true_arr.shape
sq_error = (y_true_arr - y_pred_arr) ** 2 # (n, d)
if sample_weight is not None:
w = np.asarray(sample_weight, dtype=float)
if w.ndim != 1 or w.shape[0] != n_samples:
raise ValueError(f"sample_weight must have shape ({n_samples},), got {w.shape}")
if np.any(w < 0):
raise ValueError("sample_weight must be non-negative")
denom = float(np.sum(w))
if denom == 0.0:
raise ValueError("sum(sample_weight) must be > 0")
mse_per_output = np.sum(sq_error * w[:, None], axis=0) / denom
else:
mse_per_output = np.mean(sq_error, axis=0)
if isinstance(multioutput, str):
if multioutput == "raw_values":
return mse_per_output
if multioutput == "uniform_average":
return float(np.mean(mse_per_output))
raise ValueError(
"multioutput must be 'raw_values', 'uniform_average', or array-like of output weights"
)
output_weights = np.asarray(multioutput, dtype=float)
if output_weights.shape != (n_outputs,):
raise ValueError(f"multioutput weights must have shape ({n_outputs},), got {output_weights.shape}")
weight_sum = float(np.sum(output_weights))
if weight_sum == 0.0:
raise ValueError("sum(multioutput weights) must be > 0")
return float(np.dot(mse_per_output, output_weights) / weight_sum)
# Sanity checks vs sklearn
# 1D
y_true_1d = rng.normal(size=50)
y_pred_1d = y_true_1d + rng.normal(scale=0.5, size=50)
print(
"1D equal:",
np.isclose(mean_squared_error_np(y_true_1d, y_pred_1d), mean_squared_error(y_true_1d, y_pred_1d)),
)
# Multioutput
y_true_2d = rng.normal(size=(50, 3))
y_pred_2d = y_true_2d + rng.normal(scale=0.5, size=(50, 3))
mse_raw_np = mean_squared_error_np(y_true_2d, y_pred_2d, multioutput="raw_values")
mse_raw_sk = mean_squared_error(y_true_2d, y_pred_2d, multioutput="raw_values")
print("multioutput raw close:", np.allclose(mse_raw_np, mse_raw_sk))
# Weighted
w = rng.uniform(0.1, 2.0, size=50)
mse_w_np = mean_squared_error_np(y_true_2d, y_pred_2d, sample_weight=w, multioutput="uniform_average")
mse_w_sk = mean_squared_error(y_true_2d, y_pred_2d, sample_weight=w, multioutput="uniform_average")
print("weighted uniform close:", np.isclose(mse_w_np, mse_w_sk))
# Weighted multioutput
out_w = np.array([0.2, 0.3, 0.5])
mse_wo_np = mean_squared_error_np(y_true_2d, y_pred_2d, sample_weight=w, multioutput=out_w)
mse_wo_sk = mean_squared_error(y_true_2d, y_pred_2d, sample_weight=w, multioutput=out_w)
print("weighted output close:", np.isclose(mse_wo_np, mse_wo_sk))
1D equal: True
multioutput raw close: True
weighted uniform close: True
weighted output close: True
5) Geometry intuition: constant predictions#
Suppose your model can only predict a single constant value \(a\) for every sample:
The MSE becomes a function of a single parameter:
Differentiate and set to zero:
So the constant that minimizes MSE is the mean.
The best constant MSE equals the variance#
At the optimum \(a=\bar{y}\), the minimum value is:
So predicting the mean gives an MSE equal to the (population) variance of the target.
Weighted MSE: the weighted mean#
If we use weights \(w_i \ge 0\):
then the minimizer is the weighted mean:
Interpreting \(w_i \propto 1/\sigma_i^2\) connects this to weighted least squares under heteroskedastic Gaussian noise.
This is also why MSE is sensitive to outliers: a single extreme value can move the mean (and therefore the optimum) a lot. If you have a reason to trust some points less, sample weights can reduce their influence.
y = rng.normal(loc=0.0, scale=1.0, size=60)
y_outlier = y.copy()
y_outlier[0] = 8.0 # one extreme point
a_grid = np.linspace(-4, 9, 500)
mse_clean = np.mean((y[:, None] - a_grid[None, :]) ** 2, axis=0)
mse_out = np.mean((y_outlier[:, None] - a_grid[None, :]) ** 2, axis=0)
mu_clean = float(np.mean(y))
mu_out = float(np.mean(y_outlier))
mse_at_mu_clean = float(np.mean((y - mu_clean) ** 2))
mse_at_mu_out = float(np.mean((y_outlier - mu_out) ** 2))
# Downweight the outlier (one simple way to reduce its influence)
w_down = np.ones_like(y_outlier)
w_down[0] = 0.05
wmse_down = (
np.sum(((y_outlier[:, None] - a_grid[None, :]) ** 2) * w_down[:, None], axis=0) / float(np.sum(w_down))
)
mu_w = float(np.sum(w_down * y_outlier) / np.sum(w_down))
wmse_at_mu_w = float(np.sum(((y_outlier - mu_w) ** 2) * w_down) / np.sum(w_down))
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("MSE(a) is minimized at the mean", "Weighted MSE can reduce outlier influence"),
)
# Left: clean data vs the same data with an outlier
fig.add_trace(go.Scatter(x=a_grid, y=mse_clean, mode="lines", name="MSE(a) (clean)"), row=1, col=1)
fig.add_trace(go.Scatter(x=a_grid, y=mse_out, mode="lines", name="MSE(a) (with outlier)"), row=1, col=1)
fig.add_vline(x=mu_clean, line_dash="dash", line_color="rgba(0,0,0,0.45)", row=1, col=1)
fig.add_vline(x=mu_out, line_dash="dash", line_color="rgba(0,0,0,0.45)", row=1, col=1)
fig.add_annotation(
x=mu_clean,
y=mse_at_mu_clean,
text=f"mean={mu_clean:.2f}<br>min MSE=Var(y)≈{mse_at_mu_clean:.2f}",
showarrow=True,
yshift=10,
row=1,
col=1,
)
fig.add_annotation(
x=mu_out,
y=mse_at_mu_out,
text=f"mean={mu_out:.2f}",
showarrow=True,
yshift=10,
row=1,
col=1,
)
# Right: the outlier dataset with uniform weights vs a small outlier weight
fig.add_trace(
go.Scatter(
x=a_grid,
y=mse_out,
mode="lines",
name="MSE(a) (uniform weights)",
showlegend=False,
),
row=1,
col=2,
)
fig.add_trace(go.Scatter(x=a_grid, y=wmse_down, mode="lines", name="WMSE(a) (downweight outlier)"), row=1, col=2)
fig.add_vline(x=mu_out, line_dash="dash", line_color="rgba(0,0,0,0.45)", row=1, col=2)
fig.add_vline(x=mu_w, line_dash="dash", line_color="rgba(0,0,0,0.45)", row=1, col=2)
fig.add_annotation(
x=mu_out,
y=mse_at_mu_out,
text=f"uniform mean={mu_out:.2f}",
showarrow=True,
yshift=10,
row=1,
col=2,
)
fig.add_annotation(
x=mu_w,
y=wmse_at_mu_w,
text=f"weighted mean={mu_w:.2f}",
showarrow=True,
yshift=10,
row=1,
col=2,
)
fig.update_xaxes(title_text="constant prediction a", row=1, col=1)
fig.update_yaxes(title_text="loss", row=1, col=1)
fig.update_xaxes(title_text="constant prediction a", row=1, col=2)
fig.update_yaxes(title_text="loss", row=1, col=2)
fig.update_layout(
title="Constant predictor: mean (and weighted mean) minimize squared error",
legend=dict(orientation="h"),
)
fig.show()
6) Using MSE to optimize a model: simple linear regression#
Model#
With one feature \(x\), a linear model predicts:
Objective#
Minimize MSE over parameters \((b_0, b_1)\):
Gradients#
Let \(\hat{y}_i=b_0+b_1x_i\) and residual \(r_i=y_i-\hat{y}_i\).
Gradient descent#
Update parameters with learning rate \(\eta\):
For linear regression, there is also a closed-form solution (ordinary least squares), but gradient descent shows how MSE acts as an optimization landscape.
# Synthetic dataset
n = 150
x = rng.uniform(-3, 3, size=n)
b0_true = 1.5
b1_true = 2.0
noise = rng.normal(0, 0.8, size=n)
y = b0_true + b1_true * x + noise
fig = px.scatter(x=x, y=y, title="Synthetic regression data", labels={"x": "x", "y": "y"})
fig.show()
def mse_for_line(x, y, b0, b1):
y_hat = b0 + b1 * x
r = y - y_hat
return float(np.mean(r**2))
def mse_gradients_for_line(x, y, b0, b1):
'Return (mse, dJ/db0, dJ/db1) for y_hat = b0 + b1 x.'
y_hat = b0 + b1 * x
r = y - y_hat
n = x.shape[0]
mse = float(np.mean(r**2))
db0 = float((-2.0 / n) * np.sum(r))
db1 = float((-2.0 / n) * np.sum(x * r))
return mse, db0, db1
def fit_line_gd(x, y, *, lr=0.05, n_steps=200):
b0, b1 = 0.0, 0.0
history = {
"step": [],
"b0": [],
"b1": [],
"mse": [],
}
for step in range(n_steps):
mse, db0, db1 = mse_gradients_for_line(x, y, b0, b1)
history["step"].append(step)
history["b0"].append(b0)
history["b1"].append(b1)
history["mse"].append(mse)
b0 -= lr * db0
b1 -= lr * db1
return (b0, b1), history
(b0_gd, b1_gd), hist = fit_line_gd(x, y, lr=0.05, n_steps=200)
# Closed-form OLS (for comparison)
X = np.column_stack([np.ones_like(x), x])
b0_ols, b1_ols = np.linalg.lstsq(X, y, rcond=None)[0]
print(f"true params: b0={b0_true:.3f}, b1={b1_true:.3f}")
print(f"GD params: b0={b0_gd:.3f}, b1={b1_gd:.3f}, MSE={mse_for_line(x, y, b0_gd, b1_gd):.4f}")
print(f"OLS params: b0={b0_ols:.3f}, b1={b1_ols:.3f}, MSE={mse_for_line(x, y, b0_ols, b1_ols):.4f}")
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist["step"], y=hist["mse"], mode="lines", name="MSE"))
fig.update_layout(title="Gradient descent on MSE", xaxis_title="step", yaxis_title="MSE")
fig.show()
# Fit line
x_line = np.linspace(x.min(), x.max(), 200)
y_hat_gd = b0_gd + b1_gd * x_line
y_hat_ols = b0_ols + b1_ols * x_line
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, mode="markers", name="data", marker=dict(size=6, opacity=0.75)))
fig.add_trace(go.Scatter(x=x_line, y=y_hat_gd, mode="lines", name="GD fit"))
fig.add_trace(go.Scatter(x=x_line, y=y_hat_ols, mode="lines", name="OLS fit", line=dict(dash="dash")))
fig.update_layout(title="Fitted line", xaxis_title="x", yaxis_title="y")
fig.show()
true params: b0=1.500, b1=2.000
GD params: b0=1.475, b1=1.966, MSE=0.6438
OLS params: b0=1.475, b1=1.966, MSE=0.6438
# Visualize the MSE surface over (b0, b1) with the GD path
b0_grid = np.linspace(b0_ols - 2.0, b0_ols + 2.0, 160)
b1_grid = np.linspace(b1_ols - 2.0, b1_ols + 2.0, 160)
B0, B1 = np.meshgrid(b0_grid, b1_grid)
residuals = y[None, None, :] - (B0[:, :, None] + B1[:, :, None] * x[None, None, :])
Z = np.mean(residuals**2, axis=2)
stride = max(1, len(hist["step"]) // 120)
b0_path = np.array(hist["b0"])[::stride]
b1_path = np.array(hist["b1"])[::stride]
fig = go.Figure()
fig.add_trace(
go.Contour(
x=b0_grid,
y=b1_grid,
z=Z,
contours_coloring="heatmap",
colorbar=dict(title="MSE"),
)
)
fig.add_trace(
go.Scatter(
x=b0_path,
y=b1_path,
mode="lines+markers",
name="GD path",
marker=dict(size=4, color="black"),
line=dict(color="black"),
)
)
fig.add_trace(
go.Scatter(
x=[b0_ols],
y=[b1_ols],
mode="markers",
name="OLS optimum",
marker=dict(symbol="x", size=10, color="white", line=dict(color="black", width=2)),
)
)
fig.update_layout(
title="MSE landscape for a line: convex bowl + gradient descent trajectory",
xaxis_title="b0 (intercept)",
yaxis_title="b1 (slope)",
)
fig.show()
7) Practical usage (scikit-learn)#
In practice, MSE is most often used to evaluate regression models on a validation/test set:
Lower is better.
Because it has squared units, also report \(\mathrm{RMSE}=\sqrt{\mathrm{MSE}}\) for interpretability.
Use
sample_weightwhen some samples matter more or have different noise levels.For multioutput regression, use
multioutputto get per-target values or an aggregate.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X = x.reshape(-1, 1)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, random_state=42)
model = LinearRegression()
model.fit(X_tr, y_tr)
y_pred_te = model.predict(X_te)
mse_te = mean_squared_error(y_te, y_pred_te)
rmse_te = float(np.sqrt(mse_te))
# Example: weighted evaluation (here we weight samples with large |x| more)
w_te = 0.5 + np.abs(X_te[:, 0])
mse_te_w = mean_squared_error(y_te, y_pred_te, sample_weight=w_te)
print(f"Test MSE: {mse_te:.4f}")
print(f"Test RMSE: {rmse_te:.4f}")
print(f"Weighted Test MSE (w = 0.5 + |x|): {mse_te_w:.4f}")
min_v = float(min(y_te.min(), y_pred_te.min()))
max_v = float(max(y_te.max(), y_pred_te.max()))
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=y_te,
y=y_pred_te,
mode="markers",
name="pred vs true",
marker=dict(size=7, opacity=0.8),
)
)
fig.add_trace(go.Scatter(x=[min_v, max_v], y=[min_v, max_v], mode="lines", name="ideal", line=dict(dash="dash")))
fig.update_layout(
title=f"Test set predictions (MSE={mse_te:.3f}, RMSE={rmse_te:.3f})",
xaxis_title="y_true",
yaxis_title="y_pred",
)
fig.show()
Test MSE: 0.9581
Test RMSE: 0.9788
Weighted Test MSE (w = 0.5 + |x|): 0.9059
8) Pros, cons, and when to use MSE#
Pros#
Simple and widely used baseline for regression.
Differentiable and often convex (e.g. linear regression) → friendly for optimization.
Strongly penalizes large errors (useful when big misses are particularly costly).
Has a clear statistical interpretation: minimizing MSE corresponds to maximum likelihood under i.i.d. Gaussian noise.
Cons / pitfalls#
Outlier sensitive: a few large residuals can dominate the average.
Scale dependent: MSE values depend on the units of \(y\) (hard to compare across targets without normalization).
Squared units: less interpretable than RMSE or MAE.
Optimizing MSE targets the conditional mean; if you care about medians/quantiles or heavy-tailed noise, MAE / pinball loss / Huber can be better.
Good use cases#
Standard regression tasks where errors are roughly symmetric and not extremely heavy-tailed.
When large errors should be penalized more than small ones.
As a training loss for linear models / neural nets when Gaussian noise is a reasonable assumption.
9) Exercises#
Implement
root_mean_squared_error_npand verify it matchessqrt(MSE).Create a dataset with 1% extreme outliers. Compare how MSE vs MAE changes.
Extend
fit_line_gdto multiple features: \(\hat{y}=b+Xw\).Derive the closed-form OLS solution and compare it to gradient descent.
References#
scikit-learn docs: https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error
Hastie, Tibshirani, Friedman — The Elements of Statistical Learning (least squares and loss functions)